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SUMMARY 


4/ 


It  is  sometimes  necessary  to  estimate  correlations  in  samples  that  have  been  range  restricted 


due  to  selection.  These  correlations  are  often  diminished  when  compared  to  their  population 
values.  A  correction  for  this  circumstance  is  the  subject  of  a  proof  which  is  discussed  in  the 


context  of  a  ^omputer-aided  simulation  procedure  to  study  the  nature  and  behavior  of  the 
correction.  j  ,  / 
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OOL  FOR  STUDYING  THE  EFFECTS  OF  RANGE  RESTRICTION 
ON  CORRELATION  COEFFICIENT  ESTIMATION 


I.  INTRODUCTION 

Let  X  and  Y  be  two  random  variables  defined  on  a  population  P.  Let  Q  be  a  sub-population 
of  P  and  suppose  that  we  have  a  random  sample  selected  from  Q.  X,,...,X„  and  Y,,...,Yn  will 
denote  the  X  and  Y  data  collected  from  this  sample.  The  traditional  statistic  for  estimating  the 
correlation  between  X*  and  Y*  [Px.  Y-]  is 

Z(Xi  -  X)  (Yi  -  Y) 
r  =  i 

n  -  1 

However,  this  may  not  be  a  very  good  estimate  of  PK Y.  The  need  to  know  Px  Y  when  you  have 
only  a  random  sample  from  6  is  a  problem  that  occurs  quite  naturally  and  it  has  been 
investigated  for  some  time.  The  most  widely  used  method  to  deal  with  it  has  been  to  use  a 
correction  formula  first  developed  by  Pearson  (1903),  and  then  extended  by  Lawley  (1943). 
This  formula  applies  when  certain  assumptions  are  satisfied.  These  assumptions  are  basically 
the  classical  linear  regression  model,  and  will  be  described  in  detail  later.  The  formula  applies 
only  when  Px.  Y..  is  known  exactly.  It  is  not  uncommon  to  take  a  formula  that  holds  for  population 
parameters  and  apply  it  instead  to  statistics  used  to  estimate  those  population  parameters. 
Unfortunately,  this  approach  comes  with  no  guarantees.  It  is  not  assured  to  provide  an  unbiased 
or  even  a  very  good  estimation.  Finding  a  mathematical  description  for  the  sampling  distribution 
of  the  Pearson  statistic  appears  to  be  very  difficult.  At  least  it  has  defied  solution  so  far. 
Rather  than  seeking  a  mathematical  solution,  we  decided  instead  to  take  a  computational 
approach  and  write  a  Monte  Carlo  simulation  program.  The  purpose  of  this  program  is  to 
evaluate,  under  varying  conditions,  the  accuracy  of  the  traditional  r  statistic  and  of  the  Pearson 
statistic  in  estimating  Pxy.  It  will  also  be  useful  for  testing  statistics  that  use  correlation 
coefficients  as  inputs. 


II.  NOTATION  AND  OBJECTIVES 

We  will  use  the  notation  of  Lord  and  Novick  (1968).  Assume  that  the  members  of  an 
organization  were  admitted  to  the  organization  by  virtue  of  having  passed  a  battery  of  tests. 
These  members  are  called  the  selected  group  or  the  restricted  population  and  will  be  denoted 
by  Q.  These  members  plus  those  that  were  denied  entry  constitute  the  applicant  group  or  the 
unrestricted  population  and  will  be  denoted  P  The  tests  that  were  used  as  a  basis  for  selection 
are  viewed  as  random  variables  on  P  and  are  called  the  explicit  selection  variables.  Any  other 
tests  that  are  given  to  the  members  of  the  selected  group  Q  are  called  incidental  selection 
variables.  All  random  variables  are  assumed  to  be  defined  on  P.  If  X  is  a  random  variable  on 
P  then  the  restriction  of  X  to  the  selected  group  Q  will  be  represented  by  the  notation  X*. 

Our  objective  is  to  study  the  sampling  distribution  of  two  statistics.  The  first  is  the  standard 
sample  correlation  coefficient,  r,  which  is  calculated  using  a  random  sample  from  Q.  The  second 
is  the  Pearson  correction  formula  for  range  restriction.  It  is  calculated  using  the  sample 
covariance  matrix  for  the  explicit  selection  variables  based  on  data  from  the  applicant  group 
P  plus  the  sample  covariance  matrix  for  all  variables  based  on  the  selected  group  Q. 
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In  this  study  it  is  assumed  that  the  most  general  type  of  selection  criterion  is 


/  S  ciX)  +  .  .  .  f  CnveXnve  ^  H, 


where  h  may  be  infinity,  I  may  be  negative  infinity,  and  nve  is  the  number  of  explicit  selection 
variables. 


III.  CORRECTION  IN  THE  TWO-VARIABLE  CASE 


The  correction  formula  for  range  restriction  is  usually  referred  to  as  the  Pearson  correction 
formula.  However,  it  was  Lawley  (1943)  who  established  the  minimum  assumptions  necessary 
for  its  application.  In  order  to  understand  Lawley's  theorem,  it  is  necessary  to  look  at  a  couple 
of  special  cases  in  this  and  the  next  section.  The  proof  of  the  general  theorem  is  by  generating 
functions  and  is  not  a  very  Instructive  proof.  The  proof  of  the  present  special  case,  however, 
is  instructive  and  an  outline  of  this  proof  wilt  be  given. 

Let  X  be  the  only  explicit  selection  variable  and  let  Y  be  the  only  incidental  selection 
variable.  Hence  we  have  X  and  Y  defined  on  P,  and  X*  and  Y*  defined  on  Q;  and  the  members 
of  Q  are  selected  on  the  basis  of  their  X  scores. 

Assumption  1.  (Linearity)  The  true  regression  function  of  Y  on  X  is  linear.  In  other  words, 

Y  =  a  +  bX  +  E, 

where  a  and  b  are  constants,  E  Is  a  random  variable,  and  the  expected  value  of  E  given  x  is  zero,  for  all  x. 

Note:  It  is  not  necessary  to  assume  that  X  and  E  are  independent.  Linear  regression  is  enough  to  imply  that 
cov  (X,  E)  =  0,  which  is  needed  for  the  proof  of  theorem  1 .  The  proof  that  cov  (X,  E)  =  0  follows  directly 
from  the  definition  of  covariance  and  hence  Is  omitted. 


Assumption  2.  (Homoscedasticity)  The  conditional  variance  of  Y  given  x,  does  not  depend  on  x.  In  other 
words.  a£  does  not  depend  on  x. 

Note:  Assumption  2  still  does  not  imply  that  X  and  E  are  independent. 

Theorem  1.  Under  assumptions  1  and  2 


-°2X,Y  = 


1  + 


2 

s  x. 


o2x  \  pZX*,Y* 


-  1 


(1) 


proof:  Given  that  cov(X,  E)  =  0,  it  is  a  matter  of  simple  algebraic  manipulation  and  the  relationship 


to  show  that 


cov(2  ajXj,  2  bjYj)  —  22  aibjcov(Xi.Yj) 
»  i  •  j 


b  =  PX,Y  . 


(2) 


2 


IG 

a2E  =  0,2  Y  (1  -  ^2X,Y). 


But  now  assumption  1  and  Equation  2  imply  that 


a 


Y 


,  Y* 
X*,Y*  —  , 

V 


while  assumption  2  and  Equation  3  imply  that 

a\  ( 1  "  p2x-Y)  =  °V  0  ~  p2X*,Y*)- 


(3) 

(4) 

(5) 


These  two  equations  are  exactly  equivalent  to  the  conclusion  of  the  theorem.  That  is  to  say,  you  get  the 
conclusion  by  solving  for  a2  ^  in  Equation  4,  putting  that  in  Equation  5  and  solving  for^  Y 

It  is  important  to  understand  that  the  conclusion  depends  exactly  on  linearity  and 
homoscedasticity  and  the  fact  that  Y  is  not  explicitly  restricted.  No  assumption  of  normality  is 
needed  The  population  parameters  that  appear  on  the  right  side  of  the  formula  will,  of  course, 
not  be  known  and  so  the  statistic  based  on  this  theorem  becomes 


which  is  the  Pearson  statistic  for  two  variables.  The  sampling  distribution  of  this  statistic  does  depend  on 
the  joint  distribution  of  X  and  Y.  The  simulation  program  described  later  assumes  that  this  distribution  is  the 
bivariate  normal.  From  looking  at  a  few  examples  using  that  program  it  appears  that  the  corrected  statistic 
is  always  a  slight  underestimate.  In  the  cases  examined,  this  downward  bias  seems  to  be  so  small  that  it 
could  easily  be  ignored. 

Notice  that  if  a  X*  <  a  X,  as  it  will  be  for  the  type  of  restrictions  we  are  considering, 
then  P  x*.Y*  <  p  X>Y  So  if  r2X*,Y*  is  used  instead  of  the  correction  formula,  this  gives 
an  estimate  for  a  parameter  that  is  smaller  than  ^x,U.  This  last  statement  is  not  true  when 
there  are  more  than  two  variables.  This  will  be  discussed  in  the  next  section. 

The  effect  of  range  restriction  on  population  parameters  in  the  two-variable  case  is  easily 
visualized.  Think  of  a  correlation  coefficient  as  a  measure  of  how  well  we  can  perform  the 
following  task.  Administer  test  X  to  two  randomly  chosen  individuals  and  predict  which  of  these 
individuals  would  score  highest  on  test  Y.  There  are  three  characteristics  of  the  joint  density 
function  between  X  and  Y  that  determine  how  well  we  can  predict.  The  first  is  the  slope  of 
the  regression  line.  This  does  not  change  with  a  restriction  on  X.  The  fact  that  it  does  not 
change  is  reflected  by  Equation  4,  which  is  part  of  the  proof  of  theorem  1.  It  is  obvious  that 
a  greater  slope  leads  to  greater  chance  of  successfully  predicting  which  individual  will  have  the 
larger  Y  score.  The  second  is  ^  This  does  not  change  with  a  restriction  in  X.  The  fact  that 
it  does  not  change  is  reflected  in  Equation  5,  which  is  the  other  significant  equation  in  the 
proof  of  theorem  1.  It  is  clear  that  a  smaller  °E  leads  to  a  greater  chance  of  picking  the 
correct  individual.  The  way  this  is  reflected  in  the  equation 

.  aX 

^X,Y  =  b _ 
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is  that  if  <*£  Is  made  smaller  without  changing  crx  or  b,  then  will  become  smaller.  The  third  factor  is  the 
variance  of  X.  It  is  clear  that  we  have  a  better  chance  of  choosing  the  correct  individual  if  the  X  values  of 
randomly  chosen  people  are  spread  out  rather  than  being  packed  together.  This  is  the  factor  that  is  depressed 
as  a  result  of  selection.  As  already  mentioned,  for  the  type  of  selection  in  common  use  the  variance  of  X*  is 
always  smaller  than  the  variance  of  X.  Hence  in  the  two-variable  case,  selection  always  causes  under¬ 
estimation  of  the  correlation  coefficient  If  the  correction  formula  is  not  used. 

One  of  the  benefits  of  presenting  a  proof  of  theorem  1  and  then  discussing  how  the  three 
factors  affect  correlation  coefficients  is  that  one  can  see  how  correlation  estimation  depends 
on  linearity  and  homoscedasticity.  If  one  or  both  of  these  conditions  fail  drastically,  then  it  will 
be  very  difficult  to  get  a  decent  estimate.  A  few  comments  about  this  problem  will  be  made 
at  the  end  of  this  paper. 


IV.  THREE  VARIABLES  WITH  ONE  EXPLICITLY  RESTRICTED 


Let  X  be  the  explicit  selection  variable  and  let  Y  and  Z  be  incidental  selection  variables.  Y 
is  the  criterion  or  dependent  variable  and  X  and  Z  are  the  independent  or  predictor  variables. 

Assumption  t.  The  true  regressions  of  Z  on  X,  and  Y  on  X,  are  linear. 

Assumption  2.  The  variance  of  Y  given  X,  the  variance  of  Z  given  X,  and  the  covariance  of  Z  and  Y  given 
X,  do  not  depend  on  X. 

Theorem  2:  Under  assumptions  1  and  2 


This  correction  formula  is  slightly  more  complex  and  it  permits  the  construction  of  examples 
where  the  correlation  in  the  restricted  population  is  larger  than  the  correlation  in  the  unrestricted 
population.  Levin  (1972)  refers  to  these  cases  as  "Pseudo-Paradoxical."  The  terminology  probably 
stems  from  the  fact  that  It  is  widely  assumed  in  the  literature  that  restriction  always  causes  an 
underestimate.  This  would  certainly  be  expected  on  the  basis  of  our  discussion  in  the  two-variable 
case.  Notice  that  with  the  formula  appearing  In  theorem  2  and  a  little  algebra  it  is  easy  to 
characterize  these  "Pseudo-Paradoxical"  situations  in  the  three-variable  case.  In  these  cases  the 
uncorrected  estimation  is  an  overestimation.  Taking  examples  from  Levin  (1972),  the  correction 
procedure  was  applied  and  the  simulation  showed  each  time  that  the  corrected  value  was  very 
good.  It  seemed  that  the  corrected  estimate  was  slightly  low  in  each  case  but  the  estimate 
was  so  close  that  this  low  estimation  might  not  be  a  real  effect.  In  any  case  the  bias  appears 
to  be  so  slight  that  it  is  not  practically  significant.  The  interesting  fact  is  that  the  correction 
statistic  based  on  theorem  2  works  well  in  these  cases,  at  least  when  the  joint  distribution  of 
the  three  variables  is  multinormal. 


4 


V.  THE  GENERAL  CASE 


Let  X  be  the  p-element  vector  of  explicit  selection  variables,  and  Y  the  n-p  element  vector 
of  incidental  selection  variables  on  the  applicant  group.  Y  will  contain  one  criterion  variable 
and  several  predictor  variables.  Then  X*  and  Y*  represent  the  explicit  and  incidental  selection 
variables  on  the  selected  group.  Let 


V 


Vp,p  Vp,n-p 
Vn-p,p  Vn-p,n-p 


represent  the  variance-covariance  matrix  for  X*,  Y*.  The  first  p  rows  and  columns  refer  to  the 
components  of  X*.  So  VPiP  is  the  variance-covariance  matrix  of  X*.  Vn.pn.p  is  the 
variance-covariance  matrix  for  Y*.  Vpn.p  gives  the  covariances  between  X*  and  Y*.  and  Vn.pp 
is  the  transpose  of  VPi(Vp.  In  this  discussion,  V  refers  to  selected  data  and  W  refers  to  applicant 
data.  In  our  application,  V  will  be  the  estimate  of  the  variance-covariance  of  all  tests  and  it  is 
based  on  selected  data.  The  restricted  population  consists  of  those  who  were  accepted  into 
the  organization  so  we  have  data  on  all  tests  for  these  people.  Let 


W 


Wp,p  Wp,n-p 
Wn-p,p  Wn-p, n-p 


be  the  matrix  of  variance-covariance  for  the  unselected  data.  We  will  estimate  Wp,p  from  the  data  since 
we  have  data  for  the  explicit  selection  variables  on  all  applicants.  The  Wp,n-p  ,  Wn-p,p  ,  and  Wn-p,n-p  are 
the  matrices  that  we  wish  to  know  and  will  be  given  to  us  by  the  theorem.  Wn.p,p  is,  of  course,  the 
transpose  of  Wp  n-p;  so,  we  will  just  give  an  expression  for  Wp>n-p  when  we  state  the  theorem.  The  follow¬ 
ing  statement  of  the  theorem  is  taken  from  Birnbaum,  Paulson,  and  Andrews  (1950). 


Assumption  1 :  (Linearity)  For  each  j  the  true  regression  of  Yj  on  X  is  linear. 

Assumption  2:  (Homoscedasticity)  The  conditional  variance-covariance  matrix  of  Y  given  X  does  not 
depend  on  X 

Theorem  3:  Under  assumptions  1  and  2 


W  =  W  V  ~  1 V 

P  P  Prp,p  r  p  .  n  -  p 


and 


-  v-'w,.v;nv. 


Lawley  proved  this  theorem  in  1943  using  moment-generating  functions.  Both  of  the  earlier 
theorems  (1  and  2)  are  just  special  cases  of  theorem  3.  With  some  algebraic  manipulations 
the  reader  can  verify  this  by  writing  out  the  entries  of  the  matrix  and  comparing  them  with  the 
formulas  in  the  earlier  theorems.  Remember  that  the  matrices  of  Lawley’s  theorem  are 
variance-covariance  matrices  and  so  should  be  converted  to  correlation  coefficients  for  the 
purposes  of  comparison. 

Notice  that  the  theorem  says  nothing  about  tt.e  types  of  restrictions  that  are  allowed. 
Restrictions  of  any  type  on  the  X  variables  will  preserve  the  linearity  and  homoscedasticity. 
However,  if  there  are  explicit  restriction  variables  that  are  not  known  and  hence  not  included 
in  the  equations  of  theorem  3,  then  the  accuracy  of  the  corrected  statistic  suffers.  The  conditions 
specified  in  Lawley’s  theorem  are  not  met  in  this  case. 
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VI.  AN  EXAMPLE 


The  following  data  are  taken  from  Air  Force  performance  measurement  research  records. 
They  are  the  scores  (Maier  &  Sims,  1986)  on  the  10  subtests  of  the  enlistment  qualification 
battery  (variables  1-10)  and  a  performance  test  (variable  11)  (Green  &  Wing,  1988).  Table  1 
shows  the  correlations  of  these  variables  as  observed  In  a  sample  from  the  restricted  population. 
Hence  they  are  not  corrected  for  range  restriction. 


Table  1.  Restricted  Data 


1.000 

0.143 

1.000 

0.568 

0.216 

0.381 

0.244 

0.537 

0.011 

0.225 

0.162 

0.112 

0.251 

0.200 

0.130 

0.235 

0.223 

0.371 

0.381 

0.271 

0.342 

0.172 

0.296 

0.308 

-.030 

0.161 

0.141 

0.393 

0.035 

0.185 

1.000 

0.157 

0.583 

0.254 

-.078 

-.042 

0.192 

0.244 

0.406 

-.057 

0.018 

-.158 

0.118 

0.110 

0.236 

1.000 

-.125 

1.000 

0.475 

0.220 

1.000 

0.451 

-.018 

0.283 

0.159 

0.298 

0.250 

1.000 

0.077 


1.000 


Table  2  shows  the  correlations  of  the  first  10  variables  (explicitly  restricted  variables)  as 
calculated  using  a  sample  from  the  unrestricted  population.  Pearson  correction  will  not  change 
these  values.  They  are  the  best  estimates  for  the  correlation  coefficients  between  the  explicitly 
restricted  variables.  Notice  that  some  estimates  have  changed  from  slightly  negative  to  significantly 
positive. 


Table  2.  Unrestricted  Data 


1.000 

0.722 

1.000 

0.801 

0.708 

1.000 

0.689 

0.672 

0.803 

1.000 

0.524 

0.627 

0.617 

0.608 

1.000 

0.452 

0.515 

0.550 

0.561 

0.701 

1.000 

0.637 

0.533 

0.529 

0.423 

0.306 

0.225 

1.000 

0.695 

0.827 

0.670 

0.637 

0.617 

0.520 

0.415 

1.000 

0.695 

0.684 

0.593 

0.521 

0.408 

0.336 

0.741 

0.600 

0.760 

0.658 

0.684 

0.573 

0.421 

0.342 

0.745 

0.585 

Table  3  shows  the  correlations  presented  in  Table  1  after  the  correction  procedure  has  been 
applied.  The  last  row  of  correlations  of  the  subtests  with  variable  11  have  now  been  corrected 
for  range  restriction.  It  is  seen  that  these  correlations  have  changed  considerably  in  the  process 
of  being  corrected  for  range  restriction.  These  corrected  values  are  the  best  available  estimates 
for  these  correlation  coefficients. 
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Table  3.  Corrected  Data 


1.000 

0.722 

1.000 

0.801 

0.70B 

1.000 

0.689 

0.672 

0.803 

1.000 

0.524 

0  627 

0.617 

0.608 

0.452 

0.515 

0.550 

0.561 

0.637 

0.533 

0.529 

0.423 

0.695 

0.827 

0.670 

0.637 

0.695 

0.684 

0.593 

0.521 

0.760 

0.658 

0.684 

0.573 

0.596 

0.749 

0.487 

0.489 

1.000 

0.701 

1.000 

0.306 

0.225 

1.000 

0.617 

0.520 

0.415 

0.408 

0.336 

0.741 

0.421 

0.342 

0.745 

0.465 

0.433 

0.503 

1.000 

0.600 

1.000 

0.585 

0.743 

1.000 

0.680 

0.640 

0.570 

1.000 

VII.  GENERAL  DESCRIPTION  OF  THE  SIMULATION  PROCESS 


The  program  was  written  In  PASCAL  and  is  currently  running  on  an  IBM-compatible 
microcomputer.  The  joint  distribution  of  all  of  the  random  variables  is  assumed  to  be  multinormal 
in  the  unrestricted  population.  The  inputs  to  the  program  are  listed  here  for  reference  and  they 
will  be  explained  later  as  we  discuss  the  program. 


The  number  of  variables  [nv]  and  their  names  [vnamej 

Unrestricted  population  mean  and  std-dev  [mu,  sigj 

The  correlation  coefficients  in  the  unrestricted  population  [rho] 

The  number  of  explicitly  selected  variables  [nve]  the  first  nve  entered 
The  number  of  restrictions  [nr] 

The  coefficients  of  the  explicitly  selected  variables  [ncoeff] 

Cutoff  value  for  each  restriction  [cutoff] 

Size  of  the  unrestricted  population  [nwp] 

Size  of  the  restricted  population  [nvp] 

The  number  of  times  the  experiment  will  be  repeated  [reps] 

The  two  variables  of  interest  in  the  list  of  variables  [inti  ,int2] 


Figure  1  below  is  an  example  of  a  file  describing  the  input  to  a  run.  The  first  line  says 
that  there  are  3  variables  in  this  case.  The  next  three  lines  give  the  names,  means,  and 
standard  deviations  of  the  three  variables.  In  this  case  they  each  have  mean  0.0  and  standard 
deviation  1.0.  The  next  three  lines  give  the  correlation  matrix  for  the  three  variables.  So  the 

coefficient  for  (x,y)  is  0.86,  for  (x.z)  it  is  0.0,  and  for  (y,z)  it  is  0.43.  The  next  line  gives  the 

number  of  explicit  selection  variables.  There  is  1  in  this  case  and  so  X  Is  the  only  explicit 
selection  variables.  Then  it  is  specified  that  there  is  only  1  restriction  (selection)  and  that  the 
restriction  is  X  _>  0.0. 

The  selected  group  will  consist  of  those  persons  getting  a  score  of  zero  or  greater  on  the 
X  test.  The  second  to  last  line  says  that  the  variables  of  interest  are  2  and  3  (i.e.,  Y  and  Z). 

Data  and  a  histogram  of  the  distribution  will  be  given  for  the  uncorrected  r  between  X  and  Z 

and  the  same  information  is  given  for  the  Pearson  correction  statistic.  The  program  calculates 
the  Pearson  correction  statistic  using  theorem  3  from  the  last  section.  The  line  will  be  explained 
after  the  following  discussion. 
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0.0 

1.0 

0.0 

1.0 

0.0 

1.0 

1.00 

0.86 

0.00 

0.86 

1.00 

0.43 

0.00 

0.43 

1.00 

1  #  of  explicitly  restricted  variables 

1  number  of  restrictions 
1.0  0.0 

2  3  variables  of  interest 
1  50  100 


Figure  1.  An  Input  File. 


Creating  a  multinormal  observation  is  equivalent  to  simulating  one  individual.  In  the  above 
case  this  means  getting  three  values, -one  for  each  of  the  three  test  scores  X,  Y,  and  Z.  Each 
multinormal  observation  is  part  of  the  applicant  group  and  is  also  a  member  of  the  selected 
group  if  the  scores  satisfy  all  of  the  restrictions.  For  the  present  case  this  means  that  the  score 
on  the  X  test  must  be  at  least  zero.  One  experiment  is  simulated  by  generating  observations 
until  two  conditions  are  satisfied.  There  must  be  at  least  nwp  observations  in  the  applicant 
group  and  there  must  be  at  least  nvp  observations  in  the  selected  group.  For  most  cases  we 
set  nwp  =  1  and  then  the  only  restriction  is  that  we  have  at  least  nvp  observations  in  the 
selected  group.  One  run  of  the  program  consists  of  simulating  reps  experiments.  The  last  line 
of  a  file  which  describes  a  run  gives,  nwp,  nvp  and  reps  in  that  order.  In  Figure  1,  nwp  =  1, 
nvp  =  50.  and  reps  =  100. 

When  program  corr  begins,  it  will  ask  if  the  user  wants  to  enter  the  data  necessary  to 
describe  a  run  or  to  give  the  name  of  a  file  which  contains  the  data  in  the  expected  format. 
The  file  in  Figure  1  is  called  test4  and  so  we  can  just  give  that  name  to  corr  and  the  run  is 
specified  by  the  input  parameters  in  Figure  1.  The  reason  that  test4  is  in  the  expected  format 
is  that  corr  wrote  the  file  on  a  previous  run.  It  was  written  when  corr  executed  and  it  was 
specified  that  data  would  be  entered  from  the  keyboard  and  that  these  data  were  to  be  saved 
in  a  file  named  test4.  Now  if  one  were  familiar  with  PASCAL  read  statements,  they  could  use 
a  text  editor  to  change  some  of  the  parameters  and  use  test4  for  another  run.  After  corr 
executes,  the  data  necessary  to  produce  the  histograms  of  the  corrected  and  the  uncorrected 
statistics  are  in  two  internal  files  and  one  must  run  program  plot  which  will  read  these  internal 
files  and  display  these  data  on  the  printer. 

For  each  experiment  corr  calculates  each  of  the  following  quantities. 


bO  and  bl  =  the  estimates  of  the  regression  parameters 
statu  =  the  uncorrected  estimate  of  the  correlation  coefficient 
state  =  the  corrected  estimate  of  the  correlation  coefficient  calculated 
with  the  equations  of  theorem  3 


Hence  corr  will  generate  reps  copies  of  each  of  these  parameters.  In  each  case  the  two 
implied  variables  are  inti  and  Int2,  and  the  regression  parameters  are  for  int2  on  inti.  In  the 
case  of  bO  and  bl,  the  only  values  retained  are  the  totals  so  that  after  the  reps  experiments 
have  been  generated,  the  mean  values  of  these  parameters  may  be  calculated.  In  the  case 
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.  fate,  each  observed  value  is  retained  and  written  to  the  files  pltu.dat  and  pltc.dat, 

>p(-ct ■/*•!■/  As  met'!  ned  earlier,  the  user  can  run  plot  to  have  all  these  results  displayed. 


VIII.  PROGRAM  METHODOLGY 


One  can  see  that  the  correction  procedure,  as  specified  in  theorem  3,  requires  taking  the 
inverse  of  a  matrix.  This  is  accomplished  with  the  Gauss-Jordan  matrix  inversion  algorithm  in 
unit  matops.  This  unit  also  contains  algorithms  to  multiply  and  to  subtract  matrices. 

Unit  normgen  includes  ail  of  the  routines  necessary  to  generate  a  multinormal  observation 
with  the  correlations  specified  in  the  input  file.  Suppose  that  there  are  nv  variables.  The  first 
step  is  to  generate  nv  independent  standard  normal  observations.  This  is  accomplished  by 
repeated  calls  to  algorithm  p  in  Knuth  (1969).  The  desired  multinormal  distribution  results  from 
taking  a  linear  transformation  of  these  independent  standard  normal  observations.  This 
transformation  is  obtained  by  multiplying  the  independent  observations  and  the  matrix  A  which 
is  defined  to  be  that  unique  matrix  which  is  upper  triangular  and  satisfies  AAr  =  C.  In  this 
last  equation,  Ar  refers  to  the  transpose  of  A,  and  C  is  the  variance-covariance  matrix  of  all 
variables  in  the  unrestricted  population.  For  a  complete  discussion  of  this  procedure,  consult 
Shreider  (1966)  or  Johnson  (1987).  The  matrix  A  is  calculated  by  the  recursive  procedure  solve 
called  by  transpar  in  unit  normpar. 


IX.  RECOMMENDATIONS 


Much  time  has  been  spent  in  writing  the  program;  hence,  most  of  these  recommendations 
have  to  do  with  proposed  applications  of  the  tool.  However,  based  on  a  limited  amount  of 
experimentation,  a  few  observations  seem  appropriate. 

The  correction  statistic  seems  to  work  well  under  the  conditions  of  the  theorem.  It  seems 
to  have  a  downward  bias  but,  for  the  cases  we  considered,  it  was  always  preferable  to  the 
uncorrected  statistic.  As  can  be  seen  from  the  proof  of  theorem  1.  neither  the  corrected  nor 
the  uncorrected  statistic  will  be  accurate  if  the  joint  distribution  of  all  variables  fails  to  satisfy 
the  linearity  condition  or  the  homoscedasticity  condition.  After  fully  understanding  the  theorem, 
and  a  little  experimentation  with  the  simulation  program,  it  seems  likely  that  the  best  strategy 
is  to  always  use  the  Pearson  correction  statistic  instead  of  the  uncorrected  statistic. 

There  are  a  number  of  studies  that  could  be  pursued  with  the  use  of  the  simulation  program. 
A  plot  of  the  sampling  distribution  of  the  Fisher  Z-transformatlon  of  the  corrected  statistic  looked 
approximately  normal,  as  might  be  expected.  The  Z-transformatlon  could  form  the  basis  for  a 
procedure  that  could  be  used  to  construct  confidence  intervals  for  the  true  correlation  coefficient 
based  on  the  Pearson  statistic.  It  might  be  instructive  to  modify  the  program  slightly  so  as  to 
allow  the  joint  distribution  of  all  variables  to  be  specified  in  the  Input.  This  would  allow  one  to 
test  the  confidence  intervals  procedure  using  actual  data  instead  of  stochastically  generated 
multinormal  data. 

It  would  be  useful  to  know  how  much  accuracy  is  lost  in  the  corrected  statistic  when  one 
or  more  explicit  selection  variables  have  been  omitted  from  the  model.  Based  on  a  few 
experiments,  the  accuracy  of  the  corrected  statistic  is  diminished  by  the  omission  of  explicit 
selection  variables.  It  would  be  of  value  to  know  the  magnitude  of  this  effect.  This  is  important 
since  some  people  still  use  the  two-  or  three-variable  formulas  even  when  there  is  more  than 
one  explicit  selection  variable.  The  simulation  program  is  ideally  suited  to  answer  this  question. 
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